Regression Diagnostics: - Outlier, leverage, and influence - added variable and marginal model plots
After this lab you will be able to: - assess and diagnose the extent to which outlying observations are driving your results - assess the impact of a given variablie within a multiple regression model
This lab uses materials by - Angela Dixon - Andrew Bray - Andrew Bray and Mine Cetinkaya-Rundel - Brian Caffo, Jeff Leek, Roger Peng
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
Let’s take a look the usage of “rail trails,” which are trail systems that are built on old rail grades; we will explore the relationship between temperature and ridership. Load in the RailTrail data set in the mosaicData package:
head(RailTrail)
## hightemp lowtemp avgtemp spring summer fall cloudcover precip volume
## 1 83 50 66.5 0 1 0 7.6 0.00 501
## 2 73 49 61.0 0 1 0 6.3 0.29 419
## 3 74 52 63.0 1 0 0 7.5 0.32 397
## 4 95 61 78.0 0 1 0 2.6 0.00 385
## 5 44 52 48.0 1 0 0 10.0 0.14 200
## 6 69 54 61.5 1 0 0 6.6 0.02 375
## weekday
## 1 1
## 2 1
## 3 1
## 4 0
## 5 1
## 6 1
volume) and high temperature that day (hightemp):What type of relationship do you observe? Does a linear model at least appear worth trying? What type of relationship do you expect?
Fit a bivariate regression of volume on hightemp
mod.rider <- lm(volume~hightemp,data=RailTrail)
volume of ridership and the high temperature hightemp.Test whether the conditions for regression appear reasonable:
Linearity: You already checked if the relationship between ridership and temperature is linear using a scatterplot. We should also verify this condition with a plot of the residuals vs. temperature.
plot(mod.rider$residuals ~ RailTrail$hightemp)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
Nearly normal residuals: To check this condition, we can look at a histogram
hist(mod.rider$residuals)
or a normal probability plot of the residuals.
qqnorm(mod.rider$residuals)
qqline(mod.rider$residuals) # adds diagonal line to the normal prob plot
Constant variability: There are tests for heteroskedasticity, but generally you can use plots as a rough heuristic at least when doing preliminary fitting. Constant variability means that the variability of points around the regression line remains consistant for all values of X. Thus, we again use the plot of residuals ~ independent variable to check this:
plot(mod.rider$residuals ~ RailTrail$hightemp)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
Outliers are points that don’t fit the trend in the rest of the data.
High leverage points have the potential to have an unusually large influence on the fitted model.
Influential points are high leverage points that cause a very different line to be fit than would be with that point removed.
Recall that there are numerous ways of assessing the influence that a given observation has on your model…
?influence.measures to see the full suite of influence measures in stats. The measures includerstandard - standardized residuals, residuals divided by their standard deviations)rstudent - standardized residuals, residuals divided by their standard deviations, where the ith data point was deleted in the calculation of the standard deviation for the residual to follow a t distributionhatvalues - measures of leveragedffits - change in the predicted response when the \(i^{th}\) point is deleted in fitting the model.dfbetas - change in individual coefficients when the \(i^{th}\) point is deleted in fitting the model.cooks.distance - overall change in the coefficients when the \(i^{th}\) point is deleted.resid - returns the ordinary residualsresid(fit) / (1 - hatvalues(fit)) where fit is the linear model fit returns the PRESS residuals, i.e. the leave one out cross validation residuals - the difference in the response and the predicted response at data point \(i\), where it was not included in the model fitting.One of the more common tools is standardized residuals. Standardized residuals are essentially “z-score residuals”, so if you are using a \(95%\) confidence level, you can take a quick look to see if any standardized residuals are greater than 2 (or 1.96) or less than -2 (-1.96):
plot(rstandard(mod.rider)~ RailTrail$hightemp)
abline(h = 2, lty = 3,col='red') # adds a horizontal dashed line at y = 2
abline(h = -2, lty = 3,col='red') # adds a horizontal dashed line at y = 2
Studentized residuals are just like standardized residuals, but we leave out a given point when computing the sd. That way, the very point you are trying to analzye does not factor into the standardization. In all likelihood, in most cases you won’t notice much difference between the two.
plot(rstudent(mod.rider)~ RailTrail$hightemp)
abline(h = 2, lty = 3,col='red') # adds a horizontal dashed line at y = 2
abline(h = -2, lty = 3,col='red') # adds a horizontal dashed line at y = 2
We can assess leverage by plotting hatvalues….
hat <- hatvalues(mod.rider)
plot(hat)
dfbetas and Cook’s Distance are both common ways to measure influence:
dfbetas are the difference in \(\beta_k\) when observation \(i\) is left out of the model. We care about how each point influences \(\beta_1\) in this case, not \(\beta_0\), so we’ll only plot the second column of results.
modbetas = dfbetas(mod.rider)
plot(modbetas[,2])
Cook’s Distance is an alternative measure. Cook’s Distance is a summary metric that captures the total change in all model parameters due to a given observation. For this reason, there is no absolute standard of waht is a “large” or “small” Cook’s distance. The most general criteria is that a point is highly influential when \(D_i>1\). However, the number of observations obviously directly influences the influence any one point can have independent of the values of the observation, so \(D_i > 4/n\) is also a common criteria.
modcooks = cooks.distance(mod.rider)
plot(modcooks)
abline(h=4/nrow(RailTrail),col='red')
It’s tought to know how useful a bubble plot is, but it’s fun to make!
plot(hatvalues(mod.rider), rstudent(mod.rider), type='n')
cook <- sqrt(cooks.distance(mod.rider))
points(hatvalues(mod.rider), rstudent(mod.rider), cex=10*cook/max(cook))
abline(h=c(-2,0,2), lty=2)
abline(v=c(2,3) * 3/45, lty=2)
robust regression downweights influential data
library(MASS)
mod.rider.robust = rlm(volume~hightemp,data=RailTrail)
summary(mod.rider.robust)
##
## Call: rlm(formula = volume ~ hightemp, data = RailTrail)
## Residuals:
## Min 1Q Median 3Q Max
## -253.802 -55.176 8.995 58.812 314.594
##
## Coefficients:
## Value Std. Error t value
## (Intercept) -25.3436 56.8986 -0.4454
## hightemp 5.8019 0.8124 7.1421
##
## Residual standard error: 86.29 on 88 degrees of freedom
You could also delete problem points, but I would strongly recommend avoiding this if at all possible, unless you know with great confidence that a data point is an error and not simply an outlying observation.
library(MASS)
mod.rider.delete = lm(volume~hightemp,data=RailTrail[cooks.distance(mod.rider)<=4/nrow(RailTrail),])
summary(mod.rider.delete)
##
## Call:
## lm(formula = volume ~ hightemp, data = RailTrail[cooks.distance(mod.rider) <=
## 4/nrow(RailTrail), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -262.83 -57.10 3.22 54.76 268.79
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -69.656 57.118 -1.220 0.226
## hightemp 6.513 0.828 7.866 1.12e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 94.64 on 84 degrees of freedom
## Multiple R-squared: 0.4242, Adjusted R-squared: 0.4173
## F-statistic: 61.87 on 1 and 84 DF, p-value: 1.124e-11
Residual plots are somewhat problematic for multiple regression, because we have many different input varaibles. Instead, we will use “added variable plots”
In simple linear regression we use residual plots to assess:
We fit the model:
\[ y \sim x_1 + x_2 \]
If this was a bivariate model, we could conclude that the mean function looks fairly linear but there the errors appear to have increasing variance. However, these are fake data generated from a model with constant variance!!!
In MLR, in general, you cannot infer the structure you see in the residuals vs. fitted plot as being the structure that was misspecified.
The only conclusion you can draw is that something is misspecified.
So now what?
Although several types of invalid models can create non-constant variance in the residuals, a valid model will always be structureless.
If you can be sure you have a good mean function, then the residual plot is more informative.
The objective of constructing an added variable plot is to assess how much each variable adds to your model.
Consider the some data concerning restaurants in NYC, where we’d like to build the model:
\[ Price \sim Food + Decor + Service + East \]
We can assess the isolated effect of each predictor on the response with a series of simple scatterplots…
This might be more efficient…
pairs(Price ~ Food + Decor + Service + East, data = nyc)
But this does not provide a way to look at a variable net of other variables. Instead, an added variable plot tells you how much a given predictor \(x_i\) can explain the response after the other predictors have been taken into account. An “av-plot” has:
On the y-axis, the residuals from the model predicting the response without \(x_i\).
On the x-axis, the residuals from predicting \(x_i\) using those same predictors.
First, get the residuals from the model
\[ Price \sim Decor + Service + East \]
resY <- lm(Price ~ Decor + Service + East, data = nyc)$res
Second, get the residuals from the model
\[ Food \sim Decor + Service + East \]
resX <- lm(Food ~ Decor + Service + East, data = nyc)$res
The plot them against each other…
plot(resY ~ resX)
The car package has an avPlots() function that does this for you…
library(car)
m1 <- lm(Price ~ Food + Decor + Service + East, data = nyc)
avPlot(m1,variable = "Food")
avPlots(m1)
Notice that if we fit a line through the AVP, the slope should look familiar…
AVPm1 <- lm(resY ~ resX)
AVPm1$coef
## (Intercept) resX
## 5.07403e-17 1.53812e+00
m1$coef
## (Intercept) Food Decor Service East
## -24.023799670 1.538119941 1.910087113 -0.002727483 2.068050156
AVPs can be used to assess whether it makes sense to include an additional variable in the model (similar to looking at the p-value of the predictor).
They’re a bit more informative, though, since they would also indicate if the relationship between that predictor and the response is linear in the context of the other variables.
Let’s look at home prices in LA…
In the data set LA, this scatterplot suggests two influential points but are they influential in a multiple regression model?
influence(m1)$hat.)LA <- read.csv("http://andrewpbray.github.io/data/LA.csv")
m1 <- lm(price ~ sqft + bed + city, data = LA)
levs <- influence(m1)$hat
## 1255 1289 1277 1293 1294 1292
## 0.04202814 0.04301286 0.04826781 0.07888082 0.10423838 0.15779785
e_hat <- m1$res
s <- sqrt(1/(nrow(LA) - length(m1$coef)) * sum(e_hat^2))
r <- e_hat/(s * sqrt(1 - levs))
head(r)
## 1 2 3 4 5 6
## -0.1251458 -0.1449723 -0.1155992 -0.5684922 -1.0500985 0.2047673
head(rstandard(m1))
## 1 2 3 4 5 6
## -0.1251458 -0.1449723 -0.1155992 -0.5684922 -1.0500985 0.2047673
hist(r)
tail(sort(r))
## 1193 1287 1288 1096 1294 1292
## 3.579264 3.664149 3.664149 3.744932 14.659810 27.695804
cdist <- (r^2 / length(m1$coef)) * (levs/(1 - levs))
tail(sort(cdist))
## 1254 1260 1291 1289 1294 1292
## 0.05408505 0.05847561 0.08015348 0.09034382 4.16812407 23.95308503
plot(m1, 5)
Calculate the Cook’s distance of those two observations.
Generate the Cook’s distance plot.
Look at the avplots: #### AV Plots
library(car)
avPlots(m1)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:MASS':
##
## select
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
LA <- mutate(LA, logprice = log(price), logsqft = log(sqft))
m2 <- lm(logprice ~ logsqft + bed + city, data = LA)
summary(m2)
##
## Call:
## lm(formula = logprice ~ logsqft + bed + city, data = LA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.26020 -0.24897 -0.01613 0.21804 1.37277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.13068 0.21201 24.200 <2e-16 ***
## logsqft 1.20729 0.03036 39.769 <2e-16 ***
## bed -0.03010 0.01284 -2.345 0.0191 *
## cityLong Beach -0.88280 0.03467 -25.464 <2e-16 ***
## citySanta Monica -0.09416 0.04022 -2.341 0.0194 *
## cityWestwood -0.46244 0.04876 -9.484 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3704 on 1588 degrees of freedom
## Multiple R-squared: 0.8742, Adjusted R-squared: 0.8738
## F-statistic: 2206 on 5 and 1588 DF, p-value: < 2.2e-16
avPlots(m2)
Overall, this model should look quite a bit better.
par(mfrow=c(2,2))
plot(m2)
defects <- read.table("http://www.stat.tamu.edu/~sheather/book/docs/datasets/defects.txt",
header = TRUE)
head(defects)
## Case Temperature Density Rate Defective
## 1 1 0.97 32.08 177.7 0.2
## 2 2 2.85 21.14 254.1 47.9
## 3 3 2.95 20.65 272.6 50.9
## 4 4 2.84 22.53 273.4 49.7
## 5 5 1.84 27.43 210.8 11.0
## 6 6 2.05 25.42 236.1 15.6
Let’s look at the pairwise comparisons…
pairs(Defective ~ Temperature + Density + Rate, data = defects)
Here’s our basic model…
\[ \widehat{Defective} \sim Temperature + Density + Rate \]
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.3243900 65.9264798 0.1566046 0.87676619
## Temperature 16.0779089 8.2941050 1.9384742 0.06349394
## Density -1.8272927 1.4970684 -1.2205807 0.23319949
## Rate 0.1167322 0.1306268 0.8936312 0.37971687
View a summary of model diagnostics:
View residuals plotted against each covariate…
Used to assess whether your model is doing a good job of modeling the response. If it is, you’ll see points along the identity line. If it’s not, there will be non-linear structure try to correct by transforming the response and assess on a predictor-by-predictor basis using marginal model plots.
The standardized residual plots and the plot of \(y\) on \(\hat{y}\) suggest that something is amiss, but what? We need to be sure that the structure in the data is being mirrored well by the structure in our model. This comparison is made for each predictor using the marginal model plot.
Marginal Model Plots are used to assess the marginal relationship between each predictor and the response. -It compares the fit from the model with the nonparametric fit to the scatterplot. -If your model is well-specified, these two lines will be close to coincident.
You can build them by hand using loess() or use mmp() in the car package.
Now, load the car package (if you haven’t already) and produce the marginal model plots…
Or you can use loess to make these by hand:
\[ \widehat{\sqrt{Defective}} \sim Temperature + Density + Rate \]
defects <- transform(defects, sqrtDefective = sqrt(Defective))
m2 <- lm(sqrtDefective ~ Temperature + Density + Rate, data = defects)
summary(m2)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.59297089 5.26400977 1.062492 0.29778175
## Temperature 1.56516337 0.66225665 2.363379 0.02586654
## Density -0.29166383 0.11953593 -2.439968 0.02181531
## Rate 0.01289857 0.01043012 1.236666 0.22726602
Marginal model plots for second model
How do these look?